# Evaluation scripts for MMLU

## Claude
The commands below run Claude models on the MMLU dataset. 

The engine 'claude-v1.3' is the full capacity one, whereas 'claude-instant-v1.0' is the lightweight version.

You can also specify prompt type to either 'single' or 'multiple'. 'single' refers to a group of demonstration questions with answers being supplied as a part of the prompt, whereas 'multiple' refers to having rounds of conversations for demonstration in the prompt.

```bash
cd MMLU
mkdir outputs
API_KEY=<your_api_key>
python run_mmlu_claude.py --anthropic_key=${API_KEY} --engine=claude-v1.3 --prompt_type='multiple'
```


## LLaMA v.s. Falcon

The commands below run open-resource models on the MMLU dataset (answer-only setting, no CoT). 

We have evaluated the LLaMA-65B and Falcon-40B.
The data and the prompts come from the [MMLU original repo](https://github.com/hendrycks/test). We use greedy decoding. For LLaMA we use fp16. For Falcon we evaluate both fp16 and bf16 (and the performance is similar).

```bash
LLAMA_CKPT_DIR=<path to model checkpoints>
PARAM_SIZE=65 # 7, 13, 33, 65
MODEL_TYPE=llama # ["llama", "falcon"] 
python run_mmlu_open_source.py --ckpt_dir ${LLAMA_CKPT_DIR} --param_size ${PARAM_SIZE} --model_type ${MODEL_TYPE}
```


Previously there was a bug incurred by long prompts, resulting LLaMA getting 0 scores on `high_school_european_history` and `high_school_us_history`. Under that [commit](https://github.com/FranxYao/chain-of-thought-hub/tree/9c1fe0f6c2d86706d1b54a8208f807b7181e0b8d/MMLU), LLaMA average score is 61.44.

Now we have updated the code by popping out in-context examples to make the prompt fit into the context length (for us and eu history). After fixing this, the LLaMA score becomes 63.64, basically the same number as the one reported in the paper.

Then using the identical script, Falcon 40B gets 49.08.

Below is the detailed evaluation results of the LLaMA family and Falcon family on each subtask.

|Task|LLaMA-65B|Falcon-40B (FP16)|Falcon-40B (BF16)|LLaMA-33B|LLaMA-13B|LLaMA-7B|Falcon-7B (BF16)|LLaMA-65B (Bug)|
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
|Average|0.6364|0.4996|0.4996|0.5847|0.4696|0.3522|0.2641|0.6144|
|abstract_algebra|0.3000|0.2900|0.2800|0.3700|0.3000|0.2600|0.2700|0.3000|
|anatomy|0.5778|0.4963|0.4963|0.5185|0.4667|0.3852|0.3185|0.5778|
|astronomy|0.7303|0.4803|0.4868|0.6118|0.4671|0.3421|0.3026|0.7303|
|business_ethics|0.5900|0.5800|0.5800|0.5500|0.4600|0.4100|0.2100|0.5900|
|clinical_knowledge|0.6604|0.5472|0.5509|0.5736|0.4604|0.3396|0.2566|0.6604|
|college_biology|0.6944|0.5556|0.5486|0.5903|0.4722|0.3750|0.2014|0.6944|
|college_chemistry|0.4800|0.2900|0.3000|0.4500|0.3000|0.3000|0.2300|0.4800|
|college_computer_science|0.4700|0.4600|0.4700|0.4600|0.4100|0.3000|0.2600|0.4700|
|college_mathematics|0.3500|0.3200|0.3300|0.3600|0.3700|0.3500|0.3000|0.3500|
|college_medicine|0.5376|0.4162|0.4220|0.5491|0.4162|0.3353|0.2717|0.5376|
|college_physics|0.3529|0.2059|0.2157|0.2549|0.1961|0.2353|0.2059|0.3529|
|computer_security|0.8000|0.5500|0.5600|0.6700|0.6400|0.4500|0.3200|0.8000|
|conceptual_physics|0.5830|0.4213|0.4170|0.5064|0.3830|0.3745|0.2596|0.5830|
|econometrics|0.3947|0.2807|0.2719|0.3596|0.2895|0.2807|0.2368|0.3947|
|electrical_engineering|0.5517|0.5172|0.5172|0.5310|0.4483|0.2207|0.3310|0.5517|
|elementary_mathematics|0.3968|0.3095|0.3069|0.3677|0.2698|0.2646|0.2751|0.3968|
|formal_logic|0.4365|0.2619|0.2540|0.3571|0.3175|0.2619|0.1508|0.4365|
|global_facts|0.3800|0.3300|0.3200|0.3800|0.3600|0.3000|0.3500|0.3800|
|high_school_biology|0.7452|0.6065|0.5935|0.6903|0.5129|0.3323|0.2387|0.7452|
|high_school_chemistry|0.4089|0.3547|0.3448|0.4138|0.2808|0.2759|0.2956|0.4089|
|high_school_computer_science|0.6900|0.5000|0.4900|0.6100|0.5000|0.3400|0.3000|0.6900|
|high_school_european_history|0.7939|0.5394|0.5455|0.7515|0.6182|0.4182|0.3091|0.0000|
|high_school_geography|0.7879|0.6768|0.6616|0.7121|0.5505|0.3384|0.2576|0.7879|
|high_school_government_and_politics|0.8808|0.7098|0.7047|0.8290|0.6684|0.4560|0.1762|0.8808|
|high_school_macroeconomics|0.6564|0.5026|0.5026|0.5667|0.4590|0.3436|0.2077|0.6590|
|high_school_mathematics|0.3444|0.2630|0.2667|0.2778|0.2593|0.2519|0.2407|0.3407|
|high_school_microeconomics|0.6807|0.4706|0.4706|0.5798|0.4748|0.3151|0.2563|0.6765|
|high_school_physics|0.3642|0.3179|0.3113|0.3510|0.3113|0.2781|0.2715|0.3709|
|high_school_psychology|0.8257|0.6514|0.6422|0.7596|0.6128|0.4826|0.2275|0.8257|
|high_school_statistics|0.6157|0.3981|0.4028|0.4907|0.2963|0.3333|0.1991|0.6204|
|high_school_us_history|0.8333|0.6225|0.6225|0.7892|0.5882|0.3676|0.2745|0.0000|
|high_school_world_history|0.8354|0.5316|0.5274|0.8017|0.6751|0.4346|0.2911|0.8354|
|human_aging|0.6726|0.6368|0.6323|0.6771|0.5336|0.3946|0.3453|0.6726|
|human_sexuality|0.7786|0.6336|0.6107|0.6489|0.5725|0.3511|0.2824|0.7786|
|international_law|0.8182|0.6694|0.6612|0.7521|0.6446|0.5124|0.3058|0.8182|
|jurisprudence|0.7407|0.6296|0.6296|0.7130|0.5093|0.4167|0.2778|0.7500|
|logical_fallacies|0.7730|0.5828|0.5828|0.6810|0.5276|0.4172|0.2515|0.7730|
|machine_learning|0.4732|0.2679|0.2857|0.4107|0.3214|0.2768|0.3482|0.4732|
|management|0.8252|0.7476|0.7476|0.7670|0.6602|0.3398|0.2330|0.8252|
|marketing|0.8718|0.7821|0.7778|0.8419|0.7179|0.4615|0.2821|0.8718|
|medical_genetics|0.6800|0.6300|0.6100|0.6600|0.5300|0.3700|0.2700|0.6800|
|miscellaneous|0.8135|0.6909|0.6884|0.7778|0.6475|0.4253|0.2835|0.8135|
|moral_disputes|0.7370|0.5809|0.5925|0.6676|0.5000|0.4075|0.2746|0.7370|
|moral_scenarios|0.4782|0.2872|0.2994|0.4045|0.2961|0.2425|0.2637|0.4771|
|nutrition|0.6895|0.5850|0.5817|0.6242|0.5196|0.3889|0.2549|0.6895|
|philosophy|0.7363|0.6013|0.6045|0.6624|0.5531|0.4051|0.2894|0.7363|
|prehistory|0.7407|0.5833|0.5895|0.6821|0.5185|0.3457|0.2716|0.7377|
|professional_accounting|0.4858|0.4326|0.4326|0.4468|0.3475|0.2660|0.2660|0.4858|
|professional_law|0.4980|0.3768|0.3781|0.4648|0.3709|0.3018|0.2373|0.4935|
|professional_medicine|0.6213|0.4596|0.4632|0.5588|0.5074|0.4412|0.1838|0.6176|
|professional_psychology|0.6667|0.4788|0.4788|0.6373|0.4869|0.3529|0.2794|0.6650|
|public_relations|0.7545|0.6455|0.6545|0.6909|0.6091|0.4000|0.2545|0.7545|
|security_studies|0.7184|0.5755|0.5673|0.6653|0.5306|0.3347|0.2490|0.7224|
|sociology|0.8109|0.7114|0.7114|0.7910|0.6119|0.4577|0.2886|0.8109|
|us_foreign_policy|0.8800|0.7700|0.7700|0.8300|0.7800|0.4300|0.3900|0.8800|
|virology|0.5301|0.4036|0.4217|0.4940|0.4337|0.3434|0.3916|0.5301|
|world_religions|0.8129|0.7953|0.7953|0.7953|0.6784|0.4912|0.3743|0.8129|





